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Abstract 

In many longitudinal microarray studies, the gene expression levels in a random sam- 
pie are observed repeatedly over time under two or more conditions. The resulting time 
I— I courses are generally very short, high-dimensional, and may have missing values. More- 

over, for every gene, a certain amount of variability in the temporal profiles, among 
biological replicates, is generally observed. We propose a functional mixed-effects model 
for estimating the temporal pattern of each gene, which is assumed to be a smooth func- 
^ tion. A statistical test based on the distance between the fitted curves is then carried 

out to detect differential expression. A simulation procedure for assessing the statistical 
power of our model is also suggested. We evaluate the model performance using both 
^— I simulations and a real data set investigating the human host response to BCG exposure. 

> 

1 Introduction 

00 

In a longitudinal microarray experiment, the gene expression levels of a group of biological 
04 replicates - for example human patients - are observed repeatedly over time. A typical 

study might involve two or more biological groups, for instance a control group versus a 
drug-treated group, with the goal to identify genes whose temporal profiles differ between 
• . them. It can be challenging to model these experiments in such a way that accounts 

_ ^ for both the within-individual (temporal) and between-individual correlation - failure to 

do so may lead to poor parameter estimates and ultimately a loss of power and incorrect 
^ inference. Further challenges are presented by the small number of time points over which 

^ the observations are made, typically fewer than 10, the high dimensionality of the data 

with many thousands of genes studied simultaneously, and the presence of noise, with 
many missing observations. 

In order to address these issues we present here a functional data analysis (FDA) 
approach to microarray time series analysis. In the FDA paradigm we treat observations 
as noisy realisations of an underlying smooth function of time which is to be estimated. 
These estimated functions are then treated as the fundamental unit of observation in the 
subsequent data analysis as in |11| . Similar approaches have been used for the clustering 
of time series gene expression data without replication [2 [5] but these cannot be applied 
to longitudinal designs such as the one described in Section [2j Our approach is much more 
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closely related to, and can be considered a generalisation of, the EDGE model presented 
by [12, 

The rest of this paper is organised as follows. Our motivating study is introduced in 
Section [2] In Section |3] we present our methodology based on functional mixed-effects 
models. A simulation study is discussed in Section [4] where we compare our model to 
EDGE. Section [5] provides a brief summary of our experimental findings. 

2 A case study: tubercolosis and BCG vaccination 

Tuberculosis (TB) is the leading cause of death world-wide from a curable infectious dis- 
ease. In 2006 it accounted for over 9 million new patients and over 2 million deaths; these 
figures are in spite of effective medication and a vaccine being available since 1921. This 
discrepancy is due in part to the HlV-epidemic, government cutbacks, increased immi- 
gration from high prevalence areas and the development of multi-drug resistant strains of 
the disease but ultimately due to our limited understanding of the complex interaction 
between the host and the pathogen M. tuberculosis. In particular, it has been a longstand- 
ing observation that the BCG vaccine conveys different levels of protection in different 
populations A total of 17 controlled trials of the efficacy of BCG vaccination have 
been carried out and efficacy has varied between 95% and 0%; some studies even show a 
negative effect P]. 

The purpose of this case study was to characterise the host response to BCG exposure 
by using microarrays to identify genes which were induced or repressed over time in the 
presence of BCG. 9 children with previous exposure to TB but who were then healthy, 
having completed TB therapy at least 6 months prior to the study were recruited from 
the Red Cross Children's Hospital Wellcome TB research database, matched for age and 
ethnicity. A complete description of the experimental procedures will be reported in a 
separate publication. In summary, each child contributed a BCG treated and a BCG 
negative control time series observed at 0, 2, 6, 12 and 24 hours after the addition of BCG 
or, in the case of the controls, 100/il PBS. A two-colour array platform - the Stanford 
'lyniphochip' - was used. Data prepocessing and quality control were performed using the 
GenePix4.0 software and in R using BioConductor (www.bioconductor.org). Figure [l] 
shows 9 biological replicates that have been observed for the TNF (tumor necrosis factor) 
gene, from which three typical features of the longitudinal data under analysis can be 
noted: (a) all time series are short and exhibit a clear serial correlation structure; (b) a 
few time points are missing (for instance, individual 8 has only 4 time points); (c) there 
is variability in the gene expression profiles across all individuals. 

3 Mixed-effects smoothing splines models 

Each observation being modelled is denoted by y{tij) and represents the gene expression 
measure observed on individual i at time tij, where i = 1, 2, . . . , n^, j — 1, 2, . . . , m^, Uk 
denotes the sample size in group k and rrii is the number of observations on individual i. 
In order to properly account for the features observed in Figure [l] we suggest to model 
y{tij) non-parametrically: 

y{tij) = n{tij) + Vi{tij) + dj (1) 

The model postulates that the observed gene expression measure y{t.ij) can be explained 
by the additive effect of three components: a mean response /i(-), which is assumed to 
be a smooth, non-random curve defined over the time range of interest; an individual- 
specific deviation from that mean curve, Ui(-), which is assumed to be a smooth and 
random curve observed over the same time range; and an error term which accounts 
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Figure 1: An example of 9 individual gene expression profiles (biological replicates) for the 
TNF gene. The experimental setup is described in Section [2] Shown here are the raw data, 
before model fitting. Some of the peculiar features of the data can be observed here: (i) very 
short temporal profiles, (ii) irregularly spaced design time points, (iii) missing data, and (iv) 
individual variability. 
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for the variabihty not explained by the first two terms. Formally, we treat each Vi{tij), for 
i — 1,2, . . . ,nfc, as independent and identically distributed realisations of an underlying 
stochastic process; specifically, we assume that Vi{tij) is a Gaussian process with zero- 
mean and covariance function 7(3, t), that is Vi{tij) ~ GP(0,7). The errors terms e^j 
are assumed to be independent and normally distributed with zero mean and covariance 
matrix R^. We do not assume that all individual have been observed at the same design 
time points, and all the distinct design time points are denoted by (ti, T2, . . . , t™). 

We suggest to represent the curves using cubic smoothing splines; see, for instance, 
[5]. The key idea of smoothing splines consists in making full use of all the design time 
points and then fitting the model by adding a smoothness or roughness constraint; by 
controlling the size of this constraint, we are then able to avoid curves that appear too 
wiggly. A natural way of measuring the roughness of a curve is by means of its integrated 
squared second derivative, assuming that the curve is twice-differentiable. We call fj — 
{riiri), . . . ,'ri{Tjn))'^ the vector containing the values of the mean curve estimated at all 
design time points and, analogously, the individual-specific deviations from the mean 
curve, for individual i, are collected in Vi = (ui(ri), . . . , Vi(Tm))'^ ■ The mean and individual 
curves featuring in model can be written as, respectively, ^i{tij) = xjjfj and Vi{tij) = 
xfjVi, with i = l,2,...,n, and Xij = {xiji, . . . , Xijm)'^ , with Xijr = 1 if i^j — Tj., r — 
1, . . . ,TO and Xijr = otherwise. The fact that the individual curves are assumed to be 
Gaussian processes is captured by assuming that the individual deviations are random and 
follow a zero-centred Gaussian distribution with covariance D, where D{r,s) — 7(Ts,Tr), 
r, s — 1, . . . ,m. Finally, in matrix form, model ([ij can then be rewritten as 

Tji = X^fj + XSi + e* (2) 

For simplicity, we assume that Ri = cr^/. In this form, we obtain a linear mixed-effects 
model [6,. Clearly, the model accounts for the fact that, for a given gene, the individual 
repeated measurements are correlated. Specifically, under the assumptions above, we have 
that coY{yi) = X^DXf + Ri. 

3.1 Statistical inference 

A common approach to estimating the unknown parameters of a linear mixed-effects 
model is by maximum likelihood (ML) estimation. In our model ([2]), the twice negative 
logarithm of the (unconstrained) generalised log-likelihood for is given by 

Y.[{y.- X,rf- XAfRr\y, - X.rj- X,v,) + log | ^ | +vJd-^v, + log | i?, |} . 

1=1 

The ML estimators of the mean curve ^{■) and each individual curve Vi{-) can be 
obtained by minimising a penalised version of this log-likelihood obtained by adding a 
term Xff^Gff and a term A„ |^ which impose a penalty on the roughness of 

the mean and individual curves, respectively. The matrix G is the roughness matrix that 
quantifies the smoothness of the curve [S] whilst the two scalars Xy and A are smooth- 
ing parameters controlling the size of the penalties. In principle, Uk distinct individual 
smoothing parameters can be introduced in the model but such a choice would incur a 
great computational cost during model fitting and selection. For this reason, we assume 
that, for each given gene, all individual curves share the same degree of smoothness and 
we use only one smoothing parameter Xy. 

After a rearrangement on the terms featuring in the penalised log-likelihood, the model 
can be re-written in terms of the regularised covariance matrices, Dy = (D~^ + XyG)~^ 
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and Vi — XiD^Xf + Ri. When both these variance components are known, the ML 
estimators fj and v^, for i = 1, 2, . . . , rifc, can be derived in closed-form as the minimisers of 
the penaHsed generahsed log-hkelihood. However, the variance components are generally 
unknown. All parameters can be estimated iteratively using an EM algorithm, which 
begins with some initial guesses of the variance components. The smoothing parameters 
A and A„ are found as those values, in the two-dimensional space (A x A^,), that optimise 
the corrected AIC, which includes a small sample size adjustment. The search for optimal 
smoothing values A and A„ is performed using a downhill simplex optimisation algorithm 

The objective of our analysis is to compare the estimated mean curves observed under 
the two experimental groups and assess the null hypothesis that the curves are the same. 
After fitting model ^ to the data, independently for each group and each gene, we obtain 
the estimated mean curves fi and A'^HO- One way of measuring the dissimilarity 
between these two curves consists in computing the L2 distance between them, which can 
be evaluated using the smoothed curves jl'^^^t) and fL^^^{t), thus yielding the observed 
distance D. We use this dissimilarity measure as a test statistic. Since the null distribution 
of this statistic is not available in closed form, we resort to a non-parametric bootstrap 
approach in order to approximate it. 




Figure 2: An example of simulated longitudinal data and fitted curves using both MESS and 
EDGE. The thick solid lines correspond to the fitted means for each group. The dotted lines 
are the fitted individual curves for group 1 and the dashed lines are the fitted individual curves 
for group 2. 



4 Performance assessment using simulated longitudinal 
data 

In order to assess the performance of the proposed MESS model we compared it using a 
simulation study to the EDGE model developed by [12 . While the EDGE model takes the 
same form as ([ij, their parameterisation differs from ours in that the mean function /i(-) is 
represented using B-splines and the individual curves Ui(-) are constrained to be a scalar 
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shift. In the case of the mean curve, the B-sphne representation requires specification 
of both the number and location of the knots which, unhke smoothing spHnes, offers 
discontinuous control over the degree of smoothing. Furthermore [12J represent each gene 
using the same number of basis functions which, if taken to be too small, implies a poor 
fit to those genes with the most complex temporal profiles. Conversely, if the number 
of basis functions is sufficient to model these complex genes there is a danger that some 
genes will be overfit. In the case of the individual curves Wi(-), it should be clear that 
scalar shifts would be unable to fully capture the individual variability we observe in the 
raw data given in Figure [l] This problem is compounded by the fact that [12J propose an 
F-type test statistic for inference which makes use of the model residuals. 

To determine the practical impact of these features we have set up a simulation pro- 
cedure that generates individual curves that look similar to the real experimental data. 
Our procedure is based on a mixed-effects model with the fixed- and random-effects pa- 
rameterized using B-splines, where the observations on individual i belonging to group j 
are given as 

yi = Xil3j + Zibij + Sij (3) 
- MVN(0, 3, ) el, ^ MVN(0, a, x ) 

where i = 1, . . . , n and j — 1,2. For simplicity, we use the same basis for the fixed- and 
random-effects so that Xi ^ Zi. The parameters that need to be controlled in this setting 
therefore consist of the variance components <Ji,a2, Di , D2 , the B-spline parameters for 
the group means /3i, P2, and the B-spline basis Xi which is determined by the number 
and locations of the knots, K. Given the simple patterns often observed in real data 
sets, we place a single knot at the center of the time course. Wherever possible, we tune 
the variance components based on summary statistics computed from data produced in 
real studies such as the experiment described in Section [2] We further parameterise the 
covariance matrices as 
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and introduce the notation D{t,p) for specifying these parameters. In this way we can 
vary the complexity of the individual curves by varying the parameter p and control 
the amount of variation between individuals by varying r. When p = 1, the individual 
'curves' are scalar shifts, as in the EDGE model. As p tends to 0, D tends to tI, where 
the B-spline parameters hi are independent. 

We begin simulating a given gene by first randomly generating the B-spline co-efficients 
for the mean curve of group 1, /3i, and for each individual belonging to this group, 
bii, according to the following probability distributions f3i ~ MVN(0, Z5^) and bn ^ 
MVN(0, Dbi), with covariance matrices given by = i5(0.25, 0.6) and D^-^ = ^(t^j , 0.6), 
where Tf,j ^ [/(0. 1,0.2). As in (|3|, the error term is normally distributed with variance 
(Ti. We set this variance component to be log-normally distributed with mean —2 and 
variance 0.35, values estimated from the real data. 

Each simulated gene is differentially expressed with probability 0.1. If a gene is not 
differentially expresed then observations are generated for group 2 using exactly the same 
parameters as for group 1, i.e. /3i = /32, Dh-^ = Di,2, ci = cr2- On the other hand, if a gene is 
differentially expressed, then we randomly generate a vector /3s representing the difference 
in B-spline parameters for the group mean curves, distributed as f3s ^ MVN(0, -D^^) and 

Dj^^ = £'(0.25,0.9), with /3si = 0. We then normalise the vector fig so that its L2-norm is 



6 



1 before setting P2 — Pi + Ps- By setting j3si = we ensure that both mean curves began 
the time course at the same value, which we have observed in the real data and would 
similarly be the case if the data had been t = transformed. Setting p = 0.9 for D^^ 
limits the difference between the curves in terms of complexity which, again, we observe 
in the real data where frequently the mean curves are simply vertically shifted versions 
of each other. Normalising the vector enables us to control exactly how large an effect 
size we are trying to detect by multiplying the vector by a scaling factor. 

Finally, we generate the individual curves for group 2 for a differentially expressed 
gene as before: 6,2 ^ MVN(0,i5fcJ and Db,_ = D{n^,O.Q), where ~ f/(0.1,0.2). The 
key point to note is that t^^ ^ r^j^ . By doing so, a differentially expressed gene varies 
both in terms of the mean curve and the degree of individual variation. Similarly, (72 is 
distributed identically to yet independently of cri so that the noise of the two groups is 
also different. 




Figure 3: ROC curve comparing the performance between the EDGE and MESS mo 
Results are based on 100, 000 simulated genes as described in Section [4j 

Using this simulation framework with the parameters laid out as above, we generated 
a data set consisting of 100, 000 simulated genes observed with 9 individuals per group 
with 5 timepoints at 0, 2, 6, 12 and 24 hours, following the same pattern of observations as 
the case study. 10% of genes were differentially expressed. We then used both the MESS 
and EDGE models to fit the data and identify differentially expressed genes. Figure |2] 
shows an example of a simulated gene with fitted mean and individual curves for both 
MESS and EDGE. In this instance EDGE's B-spline parameterisation seems sufficient for 
representing the mean curves but the scalar shifts do not model the data as closely as 
MESS does. Compare this simulated gene to a real example from the experimental data 
shown in Figure |4] This is the fit to the gene TNF, for which the raw data for the control 
group was given in Figure [T] We can see here that EDGE has selected too few knots to 
adequately capture the rapid spike in gene expression levels at 2 hours and that again 
the MESS model with individual curves provides a much closer fit to the data. Figure |3] 
gives the ROC curve for the simulation study based on 100, 000 simulated genes. At a 
fixed specificity of 90%, the corresponding power for MESS is 85.1% compared to 70.4% 
for EDGE. 
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Figure 4: BCG case study: a top scoring genes according to MESS (left), but not to EDGE 
(right). TNF has been strongly implicated in TB infection |3] and we would expect it to be 
ranked as highly significant. EDGE's low ranking can be partly explained by poor model 
selection failing to accurately capture the gene expression dynamics, and the inadequacy of 
scalar shifts to fully explain the variation between individuals. 



5 Experimental results 



We fit the MESS model to the BCG case study data and generated 100 bootstrap samples 
giving 3.2 million null genes from which to calculate empirical p-values based on the 
L2 distance as a test statistic. After correcting these p-values for multiple testing, 359 
probes were identified as being significantly differentially expressed, corresponding to 276 
unique genes. We provide here a brief summary of the results, leaving the full biological 
interpretation to a dedicated forthcoming publication. 

The top ten differentially expressed genes were found to be CCL20, PTGS2, SFRPl, 
ILIA, INHBA, FABP4, TNF, CXCL3, CCL19 and DHRS9. Many of these genes have 
previously been identified as being implicated in TB infection. For instance, CCL20 
was found to be upregulated in human macrophages infected with M. tuberculosis |10| 
and in vivo in patients suffering from pulmonary TB [7], while TNF-a has had a long 
association with the dieseae |4j. In total, using the GeneCards online database (www. 



genecEirds . org I, 58 of the 276 significant genes were found to have existing citations 
in the literature corresponding to M. tuberculosis infection or the BCG vaccine. Those 
which were upregulated mainly consisted of chemokines and cytokines such as CCLl, 
CCL2, CCL7, CCL18, CCL20, CXCLl, CXCL2, CXCL3, CXCL9, CXCLIO, TNF, CSF2, 
CSF3, IFNG, ILIA, ILIB, IL6 and IL8 while the downregulated genes were exemplified 
by transmembrame receptors CD86, CD163, TLRl, TLR4 and IL8RB. The large number 
of differentially expressed genes that we would have expected to identify lends credence 
to those genes without citations and whose role in the host response to BCG is currently 
unknown. 



8 



6 Conclusions 



In this work we have presented a non-parametric mixed-effects model based on smoothing 
splines for the analysis of longitudinal gene expression profiles. Experimental results based 
on both simulated and real data demonstrate that the use of both a flexible model that 
incorporates individual curves and an appropriate test-statistic yields higher statistical 
power than existing functional data analysis approaches. 
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